18 research outputs found
New scalable machine learning methods: beyond classification and regression
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
The recent surge in data available has spawned a new and promising age of machine
learning. Success cases of machine learning are arriving at an increasing rate as some
algorithms are able to leverage immense amounts of data to produce great complicated
predictions. Still, many algorithms in the toolbox of the machine learning practitioner
have been render useless in this new scenario due to the complications associated with
large-scale learning. Handling large datasets entails logistical problems, limits the computational
and spatial complexity of the used algorithms, favours methods with few or
no hyperparameters to be con gured and exhibits speci c characteristics that complicate
learning. This thesis is centered on the scalability of machine learning algorithms,
that is, their capacity to maintain their e ectivity as the scale of the data grows, and
how it can be improved. We focus on problems for which the existing solutions struggle
when the scale grows. Therefore, we skip classi cation and regression problems and
focus on feature selection, anomaly detection, graph construction and explainable machine
learning. We analyze four di erent strategies to obtain scalable algorithms. First,
we explore distributed computation, which is used in all of the presented algorithms.
Besides this technique, we also examine the use of approximate models to speed up
computations, the design of new models that take advantage of a characteristic of the
input data to simplify training and the enhancement of simple models to enable them
to manage large-scale learning. We have implemented four new algorithms and six
versions of existing ones that tackle the mentioned problems and for each one we report
experimental results that show both their validity in comparison with competing
methods and their capacity to scale to large datasets. All the presented algorithms
have been made available for download and are being published in journals to enable
practitioners and researchers to use them.[Resumen]
El reciente aumento de la cantidad de datos disponibles ha dado lugar a una nueva y
prometedora era del aprendizaje máquina. Los éxitos en este campo se están sucediendo
a un ritmo cada vez mayor gracias a la capacidad de algunos algoritmos de aprovechar
inmensas cantidades de datos para producir predicciones difÃciles y muy certeras. Sin
embargo, muchos de los algoritmos hasta ahora disponibles para los cientÃficos de datos
han perdido su efectividad en este nuevo escenario debido a las complicaciones asociadas
al aprendizaje a gran escala. Trabajar con grandes conjuntos de datos conlleva
problemas logÃsticos, limita la complejidad computacional y espacial de los algoritmos
utilizados, favorece los métodos con pocos o ningún hiperparámetro a configurar y
muestra complicaciones especÃficas que dificultan el aprendizaje. Esta tesis se centra en
la escalabilidad de los algoritmos de aprendizaje máquina, es decir, en su capacidad de
mantener su efectividad a medida que la escala del conjunto de datos aumenta. Ponemos
el foco en problemas cuyas soluciones actuales tienen problemas al aumentar la
escala. Por tanto, obviando la clasificación y la regresión, nos centramos en la selección
de caracterÃsticas, detección de anomalÃas, construcción de grafos y en el aprendizaje
máquina explicable. Analizamos cuatro estrategias diferentes para obtener algoritmos
escalables. En primer lugar, exploramos la computación distribuida, que es utilizada en
todos los algoritmos presentados. Además de esta técnica, también examinamos el uso
de modelos aproximados para acelerar los cálculos, el dise~no de modelos que aprovechan
una particularidad de los datos de entrada para simplificar el entrenamiento y la
potenciación de modelos simples para adecuarlos al aprendizaje a gran escala. Hemos
implementado cuatro nuevos algoritmos y seis versiones de algoritmos existentes que
tratan los problemas mencionados y para cada uno de ellos detallamos resultados experimentales
que muestran tanto su validez en comparación con los métodos previamente
disponibles como su capacidad para escalar a grandes conjuntos de datos. Todos los algoritmos presentados han sido puestos a disposición del lector para su descarga y
se han difundido mediante publicaciones en revistas cientÃficas para facilitar que tanto
investigadores como cientÃficos de datos puedan conocerlos y utilizarlos.[Resumo]
O recente aumento na cantidade de datos dispo~nibles deu lugar a unha nova e prometedora
era no aprendizaxe máquina. Os éxitos neste eido estanse a suceder a un
ritmo cada vez maior gracias a capacidade dalgúns algoritmos de aproveitar inmensas
cantidades de datos para producir prediccións difÃciles e moi acertadas. Non obstante,
moitos dos algoritmos ata agora dispo~nibles para os cientÃficos de datos perderon a súa
efectividade neste novo escenario por mor das complicacións asociadas ao aprendizaxe
a grande escala. Traballar con grandes conxuntos de datos leva consigo problemas
loxÃsticos, limita a complexidade computacional e espacial dos algoritmos empregados,
favorece os métodos con poucos ou ningún hiperparámetro a configurar e ten complicacións especÃficas que dificultan o aprendizaxe. Esta tese céntrase na escalabilidade dos
algoritmos de aprendizaxe máquina, é dicir, na súa capacidade de manter a súa efectividade
a medida que a escala do conxunto de datos aumenta. Tratamos problemas para
os que as solucións dispoñibles teñen problemas cando crece a escala. Polo tanto, deixando
no canto a clasificación e a regresión, centrámonos na selección de caracterÃsticas,
detección de anomalÃas, construcción de grafos e no aprendizaxe máquina explicable.
Analizamos catro estratexias diferentes para obter algoritmos escalables. En primeiro
lugar, exploramos a computación distribuÃda, que empregamos en tódolos algoritmos
presentados. Ademáis desta técnica, tamén examinamos o uso de modelos aproximados
para acelerar os cálculos, o deseño de modelos que aproveitan unha particularidade dos
datos de entrada para simplificar o adestramento e a potenciación de modelos sinxelos
para axeitalos ao aprendizaxe a gran escala. Implementamos catro novos algoritmos e
seis versións de algoritmos existentes que tratan os problemas mencionados e para cada
un deles expoñemos resultados experimentais que mostran tanto a súa validez en comparación cos métodos previamente dispoñibles como a súa capacidade para escalar a
grandes conxuntos de datos. Tódolos algoritmos presentados foron postos a disposición
do lector para a súa descarga e difundÃronse mediante publicacións en revistas cientÃficas para facilitar que tanto investigadores como cientÃficos de datos poidan coñecelos e
empregalos
Scalable Feature Selection Using ReliefF Aided by Locality-Sensitive Hashing
Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] Feature selection algorithms, such as ReliefF, are very important for processing high-dimensionality data sets. However, widespread use of popular and effective such algorithms is limited by their computational cost. We describe an adaptation of the ReliefF algorithm that simplifies the costliest of its step by approximating the nearest neighbor graph using locality-sensitive hashing (LSH). The resulting ReliefF-LSH algorithm can process data sets that are too large for the original ReliefF, a capability further enhanced by distributed implementation in Apache Spark. Furthermore, ReliefF-LSH obtains better results and is more generally applicable than currently available alternatives to the original ReliefF, as it can handle regression and multiclass data sets. The fact that it does not require any additional hyperparameters with respect to ReliefF also avoids costly tuning. A set of experiments demonstrates the validity of this new approach and confirms its good scalability.This study has been supported in part by the Spanish Ministerio de EconomÃa y Competitividad (projects PID2019-109238GB-C2 and TIN 2015-65069-C2-1-R and 2-R), partially funded by FEDER funds of the EU and by the Xunta de Galicia (projects ED431C 2018/34 and Centro Singular de Investigación de Galicia, accreditation 2016-2019). The authors wish to thank the Fundación Pública Galega Centro Tecnolóxico de Supercomputación de Galicia (CESGA) for the use of their computing resources. Funding for open access charge: Universidade da Coruña/CISUGXunta de Galicia; ED431C 2018/3
Regression Tree Based Explanation for Anomaly Detection Algorithm
[Abstract]
This work presents EADMNC (Explainable Anomaly Detection on Mixed Numerical and Categorical spaces), a novel approach to address explanation using an anomaly detection algorithm, ADMNC, which provides accurate detections on mixed numerical and categorical input spaces. Our improved algorithm leverages the formulation of the ADMNC model to offer pre-hoc explainability based on CART (Classification and Regression Trees). The explanation is presented as a segmentation of the input data into homogeneous groups that can be described with a few variables, offering supervisors novel information for justifications. To prove scalability and interpretability, we list experimental results on real-world large datasets focusing on network intrusion detection domain.This research was partially funded by European Union ERDF funds, Ministerio de Ciencia e Innovación
grant number PID2019-109238GB-C22, Xunta de Galicia through the accreditation of Centro Singular de
Investigación 2016-2020, Ref. ED431G/01 and Grupos de Referencia Competitiva, Ref. GRC2014/035Xunta de Galicia; ED431G/01Xunta de Galicia; GRC2014/03
Sustainable personalisation and explainability in Dyadic Data Systems
[Abstract]: Systems that rely on dyadic data, which relate entities of two types together, have become ubiquitously used in fields such as media services, tourism business, e-commerce, and others. However, these systems have had a tendency to be black-box systems, despite their objective of influencing people's decisions. There is a lack of research on providing personalised explanations to the outputs of systems that make use of such data, that is, integrating the idea of Explainable Artificial Intelligence into the field of dyadic data. Moreover, the existing approaches rely heavily on Deep Learning models for their training, reducing their overall sustainability. In this work, we propose a computationally efficient model which provides personalisation by generating explanations based on user-created images. In the context of a particular dyadic data system, the restaurant review platform TripAdvisor, we predict, for any (user, restaurant) pair, the review of the restaurant that is most adequate to present it to the user, based on their personal preferences. This model exploits the usage of efficient Matrix Factorisation techniques combined with feature-rich embeddings of the pre-trained Image Classification models, developing a method capable of providing transparency to dyadic data systems while reducing as much as 80% the carbon emissions of training compared to alternative approaches
Fast anomaly detection with locality-sensitive hashing and hyperparameter autotuning
This paper presents LSHAD, an anomaly detection (AD) method based on Locality Sensitive Hashing (LSH), capable of dealing with large-scale datasets. The resulting algorithm is highly parallelizable and its implementation in Apache Spark further increases its ability to handle very large datasets. Moreover, the algorithm incorporates an automatic hyperparameter tuning mechanism so that users do not have to implement costly manual tuning. Our LSHAD method is novel as both hyperparameter automation and distributed properties are not usual in AD techniques. Our results for experiments with LSHAD across a variety of datasets point to state-of-the-art AD performance while handling much larger datasets than state-of-the-art alternatives. In addition, evaluation results for the tradeoff between AD performance and scalability show that our method offers significant advantages over competing methods.This research has been financially supported in part by the Spanish Ministerio de EconomÃa y Competitividad (project PID-2019-109238GB-C22) and by the Xunta de Galicia (grants ED431C 2018/34 and ED431G 2019/01) through European Union ERDF funds. CITIC, as a research center accredited by the Galician University System, is funded by the ConsellerÃa de Cultura, Educación e Universidades of the Xunta de Galicia, supported 80% through ERDF Funds (ERDF Operational Programme Galicia 2014–2020) and 20% by the SecretarÃa Xeral de Universidades (Grant ED431G 2019/01).This work was also supported by National Funds through the Portuguese FCT - Fundação para a Ciência e a Tecnologia (projects UIDB/00760/2020 and UIDP/00760/2020).info:eu-repo/semantics/publishedVersio
Sustainable Transparency in Recommender Systems: Bayesian Ranking of Images for Explainability
Recommender Systems have become crucial in the modern world, commonly guiding
users towards relevant content or products, and having a large influence over
the decisions of users and citizens. However, ensuring transparency and user
trust in these systems remains a challenge; personalized explanations have
emerged as a solution, offering justifications for recommendations. Among the
existing approaches for generating personalized explanations, using visual
content created by the users is one particularly promising option, showing a
potential to maximize transparency and user trust. Existing models for
explaining recommendations in this context face limitations: sustainability has
been a critical concern, as they often require substantial computational
resources, leading to significant carbon emissions comparable to the
Recommender Systems where they would be integrated. Moreover, most models
employ surrogate learning goals that do not align with the objective of ranking
the most effective personalized explanations for a given recommendation,
leading to a suboptimal learning process and larger model sizes. To address
these limitations, we present BRIE, a novel model designed to tackle the
existing challenges by adopting a more adequate learning goal based on Bayesian
Pairwise Ranking, enabling it to achieve consistently superior performance than
state-of-the-art models in six real-world datasets, while exhibiting remarkable
efficiency, emitting up to 75% less CO during training and inference with
a model up to 64 times smaller than previous approaches
Los docentes que no han dejado de ser alumnos. Retos y experiencias en dos medios diferentes: online vs presencial
En este trabajo presentamos cómo ha sido nuestra primera experiencia docente en dos marcos distintos: por un lado en una asignatura presencial del Grado de Informática de la Universidade da Coruña y por el otro en una asignatura online en el Máster Universitario en Investigación en Inteligencia Artificial de la Universidad Internacional Menéndez Pelayo. La experiencia de impartir simultáneamente ambas asignaturas nos ha permitido conocer las diferencias entre estos dos tipos de enseñanza. Nuestra intención es poner de manifiesto cómo hemos solventado los retos que nos plantearon las dos asignaturas, a fin de que el lector pueda servirse de nuestras breves pero intensas peripecias docentes.In this work, we describe our first teaching experience in two different areas: a face-to-face subject in the Computer Science Degree of the University of A Coruña and an online subject in the Research Master’s Degree in Artificial Intelligence of the Menéndez Pelayo International University. The experience of teaching both subjects simultaneously has allowed us to know the differences between both areas. We want to show how we solved the challenges posed by these two subjects with the aim that the reader can use our brief but intense teaching adventures
Multithreaded and Spark parallelization of feature selection filters
©2016 Elsevier B.V. All rights reserved. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/bync-nd/4.0/. This version of the article has been accepted for publication in Journal of Computational Science. The Version of Record is available online at https://doi.org/10.1016/j.jocs.2016.07.002Versión final aceptada de: C. Eiras-Franco, V. Bolón-Canedo, S. Ramos, J. González-DomÃnguez, A. Alonso-Betanzos, and J. Touriño, "Multithreaded and Spark parallelization of feature selection filters", Journal of Computational Science, Vol. 17, Part 3, Nov. 2016, Pp. 609-619[Abstract]: Vast amounts of data are generated every day, constituting a volume that is challenging to analyze. Techniques such as feature selection are advisable when tackling large datasets. Among the tools that provide this functionality, Weka is one of the most popular ones, although the implementations it provides struggle when processing large datasets, requiring excessive times to be practical. Parallel processing can help alleviate this problem, effectively allowing users to work with Big Data. The computational power of multicore machines can be harnessed by using multithreading and distributed programming, effectively helping to tackle larger problems. Both these techniques can dramatically speed up the feature selection process allowing users to work with larger datasets. The reimplementation of four popular feature selection algorithms included in Weka is the focus of this work. Multithreaded implementations previously not included in Weka as well as parallel Spark implementations were developed for each algorithm. Experimental results obtained from tests on real-world datasets show that the new versions offer significant reductions in processing times.This work has been financed in part by Xunta de Galicia under Research Network R2014/041 and project GRC2014/035, and by Spanish Ministerio de EconomÃa y Competitividad under projects TIN2012-37954 and TIN-2015-65069-C2-1-R, partially funded by FEDER funds of the European Union. V. Bolón-Canedo acknowledges support of the Xunta de Galicia under postdoctoral Grant code ED481B 2014/164-0. Additionally, the collaboration of Jorge Veiga on setting up and using the MREv tool for Spark execution was essential for this work.Xunta de Galicia; R2014/041Xunta de Galicia; GRC2014/035Xunta de Galicia; ED481B 2014/164-
Proceedings of the 8th International Conference on Data Science, Technology and Applications (DATA 2019)
[Abstract] The aim of this work is to propose different statistical and machine learning methodologies for identifying
anomalies and control the quality of energy efficiency and hygrothermal comfort in buildings. Companies
focused on energy sector for buildings are interested on statistical and machine learning tools to automate
the control of energy consumption and ensure quality of Heat Ventilation and Air Conditioning (HVAC) installations.
Consequently, a methodology based on the application of the Local Correlation Integral (LOCI)
anomaly detection technique has been proposed. In addition, the most critical variables for anomaly detection
are identified by using ReliefF method. Once vectors of critical variables are obtained, multivariate and
univariate control charts can be applied to control the quality of HVAC installations (consumption, thermal
comfort). In order to test the proposed methodology, the companies involved in this project have provided
the case study of a store of a clothing brand located in a shopping center in Panama. It is important to note
that this is a controlled case study for which all the anomalies have been previously identified by maintenance
personnel. Moreover, as an alternatively solution, in addition to machine learning and multivariate techniques,
new nonparametric control charts for functional data based on data depth have been proposed and applied to
curves of daily energy consumption in HVAC.Ministerio de Asuntos Económicos y Transformación Digital; MTM2014-52876-RMinisterio de Asuntos Económicos y Transformación Digital; MTM2017-82724-RXunta de Galicia; ED431C-2016-015Centro Singular de Investigación de Galicia; ED431G/01 2016-19Centro de Investigación en TecnoloxÃas da Información e as Comunicacións da Universidade da Coruña; PC18/03Escuela Politécnica Nacional of Ecuador; PII-DM-002-201
Análisis fisiológico de las tareas de entrenamiento en fútbol sala
It is important to be able to accurately monitor training load during futsal drills intended for physical development to allow the optimization of training parameters. The aim of this study was to analyze the conditional profile of futsal drills. Eight professional futsal players were assessed for heart rate, blood lactate, duration, and intervention time responses to 8 commonly used futsal training drills. Statistical analysis was realised with SPSS 20.0, and comprises general descriptive statistics and two ANOVA with Bonferroni correction. The results showed that real game exercises not reached the physiological load of matches. Furthermore, speed endurance drills reached bigger lactate concentration than the other futsal training activities. Finally, transition, mobility, full field, 4x4 and fly-goalkeeper drills had similar conditional characteristics, near to mixed endurance and anaerobic threshold. In conclusion, analyzed drills are adequate for the development of the metabolic pathways essential in futsal.El objetivo del estudio fue obtener un perfil condicional de las tareas en fútbol sala, analizándolas en función de 5 variables (tiempo de intervención, duración, FCMáx, FCMedia y concentración de lactato). Participaron 8 jugadores profesionales, con una muestra total de 70 tareas agrupadas en 8 subcategorÃas. El análisis estadÃstico fue realizado con el SPSS 20.0, y consta de análisis descriptivos generales y dos pruebas ANOVA de un factor con corrección de Bonferroni. Los resultados muestran que las tareas de juego real no alcanzan la carga fisiológica de la competición. Además, las tareas de resistencia a la velocidad alcanzan una lactacidemia superior al resto. Finalmente, las tareas de transición, movilidad, campo completo, 4x4 y portero-jugador tienen caracterÃsticas condicionales similares, adecuadas para el desarrollo de la resistencia mixta y umbral anaeróbico. Se concluye que las tareas analizadas sirven para el desarrollo de las diferentes vÃas metabólicas caracterÃsticas del fútbol sala